GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 7 - Natural Language Processing/[Python] Natural Language Processing.ipynb
¹³³³ views

Kernel: Python 3

Natural Language Processing

Data Preprocessing

In [1]:

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re # for regex
# import nltk # The Natural Language Toolkit library
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer # for tekenization
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 8]

In [2]:

dataset = pd.read_table('Restaurant_Reviews.tsv') # tsv stands for tab seperated variable

In [3]:

dataset.head(10)

Out[3]:

In [4]:

len(dataset)

Out[4]:

1000

Cleaning the texts

In [5]:

corpus = []

for i in range(0, 1000):
    # Keep only alphabets and replace any other character with a whitespace
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])

    # Change evrything to lowercase
    review = review.lower()

    # Remove the non significant words eg. 'the', 'a', 'an', 'in', 'on' i.e all the preposition and articles
    # Thereafter stemming
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]

    # Change list back to sentence
    review = ' '.join(review)
    
    # Append newly generated sentence in corpus
    corpus.append(review)

In [6]:

dataset.head(10)

Out[6]:

In [7]:

corpus[0:10]

Out[7]:

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch']

Creating the Bag of Words model

In [8]:

cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [9]:

X.shape

Out[9]:

(1000, 1500)

In [10]:

y = dataset.iloc[:, 1].values

In [11]:

y[0:10]

Out[11]:

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1])

Splitting the dataset into the Training set and Test set

In [12]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Fitting Naive Bayes to the Training set

In [13]:

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Out[13]:

GaussianNB(priors=None)

Predicting the Test set results

In [14]:

y_pred = classifier.predict(X_test)

Making the Confusion Matrix

In [15]:

from sklearn.metrics import confusion_matrix
cm_nb = confusion_matrix(y_test, y_pred)
cm_nb

Out[15]:

array([[ 66,  62],
       [ 18, 104]])

Homework

1. Run the other classification models we made in Part 3 - Classification, other than the one we used in the last tutorial.

Decision Tree

In [16]:

# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_dt = confusion_matrix(y_test, y_pred)
cm_dt

Out[16]:

array([[94, 34],
       [50, 72]])

Random Forest Classification

In [17]:

# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_rf = confusion_matrix(y_test, y_pred)
cm_rf

Out[17]:

array([[113,  15],
       [ 56,  66]])

2. Evaluate the performance of each of these models. Try to beat the Accuracy obtained in the tutorial. But remember, Accuracy is not enough, so you should also look at other performance metrics like Precision (measuring exactness), Recall (measuring completeness) and the F1 Score (compromise between Precision and Recall). Please find below these metrics formulas (TP = # True Positives, TN = # True Negatives, FP = # False Positives, FN = # False Negatives):

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 Score = 2 * Precision * Recall / (Precision + Recall)

Accuracy, Precision, Recall, F1 Score of Naive Bayes

In [18]:

A = (cm_nb[0][0] + cm_nb[1][1])/np.sum(cm_nb)
P = (cm_nb[1][1])/(cm_nb[1][1] + cm_nb[0][1])
R = (cm_nb[1][1])/(cm_nb[1][1] + cm_nb[1][0])
print('Accuracy of Naive Bayes:', A)
print('Precision of Naive Bayes:', P)
print('Recall of Naive Bayes:', R)
print('F1 Score of Naive Bayes:', 2 * P * R / (P + R))

Out[18]:

Accuracy of Naive Bayes: 0.68
Precision of Naive Bayes: 0.626506024096
Recall of Naive Bayes: 0.852459016393
F1 Score of Naive Bayes: 0.722222222222

Accuracy, Precision, Recall, F1 Score of Decision Tree

In [19]:

A = (cm_dt[0][0] + cm_dt[1][1])/np.sum(cm_dt)
P = (cm_dt[1][1])/(cm_dt[1][1] + cm_dt[0][1])
R = (cm_dt[1][1])/(cm_dt[1][1] + cm_dt[1][0])
print('Accuracy of Decision Tree:', A)
print('Precision of Decision Tree:', P)
print('Recall of Decision Tree:', R)
print('F1 Score of Decision Tree:', 2 * P * R / (P + R))

Out[19]:

Accuracy of Decision Tree: 0.664
Precision of Decision Tree: 0.679245283019
Recall of Decision Tree: 0.590163934426
F1 Score of Decision Tree: 0.631578947368

Accuracy, Precision, Recall, F1 Score of Random Forest

In [20]:

A = (cm_rf[0][0] + cm_rf[1][1])/np.sum(cm_rf)
P = (cm_rf[1][1])/(cm_rf[1][1] + cm_rf[0][1])
R = (cm_rf[1][1])/(cm_rf[1][1] + cm_rf[1][0])
print('Accuracy of Random Forest:', A)
print('Precision of Random Forest:', P)
print('Recall of Random Forest:', R)
print('F1 Score of Random Forest:', 2 * P * R / (P + R))

Out[20]:

Accuracy of Random Forest: 0.716
Precision of Random Forest: 0.814814814815
Recall of Random Forest: 0.540983606557
F1 Score of Random Forest: 0.650246305419

In [ ]: